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Abstract — Erasure coding is a storage-efficient alternative 
to replication for acliieving reliable data backup in distributed 
storage systems. During the storage process, traditional erasure 
codes require a unique source node to create and upload all the 
redundant data to the different storage nodes. However, such a 
source node may have limited communication and computation 
capabilities, which constrain the storage process throughput. 
Moreover, the source node and the different storage nodes 
might not be able to send and receive data simultaneously - 
e.g., nodes might be busy in a datacenter setting, or simply be 
offline in a peer-to-peer setting - which can further threaten 
the efficacy of the overall storage process. In this paper we 
propose an "in-network" redundancy generation process that 
leverages on the self-repairing property of the novel SRC 
codes. This in-network redundancy generation allows storage 
nodes to generate new redundant data by exchanging partial 
information among themselves, improving the throughput of 
the storage process. The process is carried out asynchronously, 
utilizing spare bandwidth and computing resources from the 
storage nodes. We analytically show that the performance 
of this technique relies on an efficient usage of the spare 
node resources, and we derive a set of scheduling algorithms 
to maximize the same. We experimentally show that our 
algorithms can, depending on the environment characteristics, 
increase the throughput of the storage process significantly with 
respect to the classical naive storage approach. 

Ae^M'ortfe-distributed storage; erasure codes; backup; 

I. Introduction 

There is a continued and rapid global growth in data 
storage needs. Archival and backup storage form a specific 
niche, of importance to both businesses and individuals. A 
recent market analysis from IDcQ stated that the global 
revenue of the data archival business is expected to reach 
$6.5 billion in 2015. The necessity to cost-effectively scale- 
up data backup systems to meet this ever growing storage 
demand poses a challenge to storage systems designers. 

When large volume of data is involved, deploying a net- 
worked distributed storage system becomes essential, since a 
single storage node cannot scale. Furthermore, distribution 
provides opportunities for fault tolerance and parallelized 

1http://www.idc.com/getdoc.jsp?containerld=230762| 



I/O. Examples of such distributed storage systems are readily 
found in datacenter environments, including distributed file 
systems such as GFS UJ or HDFS [2|, distributed key- 
value stores like Dynamo O or Cassandra ID; as well as in 
ad-hoc end user resource based peer-to-peer (P2P) settings 
such as OceanStore [5} and friend-to-friend (F2F) storage 
systems ID, jT], a special kind of peer-to-peer systems often 
considered particularly suitable for personal data backup. 

An important design aspect in distributed storage systems 
is redundancy management. Data replication provides a 
simple way to achieve high fault-tolerance, while erasure 
codes such as Reed-Solomon codes [8] are more sophisti- 
cated alternatives, capable of significantly reducing the data 
storage footprint for different levels of fault-tolerance f9l- 
[J2|. Various trade-offs in adopting erasure codes in storage 
systems, such as storage overhead 8c fault-tolerance, access 
frequency & decoding overheads and repair bandwidth & 
repair time after failures, have been studied in the literature, 
revealing in particular that erasure codes are particularly 
suitable for backup and archival storage, where data access 
is infrequent, and hence the effects of decoding are marginal. 
A relatively unexplored aspect of the usage of erasure codes 
in storage systems is the time required to insert new data. 

When using replication, a source node aiming to store new 
data uploads one replica of this data to the first storage node, 
which can concurrently forward the same data to a second 
storage node, and so on. Doing so, the load for redundancy 
insertion can be shared, and the source node may not need 
to upload any redundant information itself. Replication thus 
naturally supports "in-network" generation of redundancy, 
that is, generation of new redundancy within the network, 
through data exchange among storage nodes, which in 
turn leads to fast insertion of data. In contrast, in erasure 
encoded systems, the source node is the one responsible for 
computing and uploading all the encoded redundant data to 
the corresponding storage nodes. The amount of data the 
source node uploads is then considerably larger, resulting 
in longer data insertion times. Insertion latency may further 
be exacerbated when the source node and the set of storage 



nodes have additional (mismatched) temporal constraints on 
resources availability, in which case in-network redundancy 
generation can provide partial mitigation. 

We elaborate the effect of temporal resource 
(un)availability issues with two distinct example scenarios, 
which we also use later in our experiments to determine 
how (much) in-network redundancy generation may help: 

« In datacenters, storage nodes might be used for compu- 
tation processes which require efficient access to local 
disks. Since backup processes consume large amounts 
of local disk I/O, system administrators might want to 
avoid backup transfers while nodes are executing I/O 
intensive tasks - e.g., Mapreduce tasks. 

« In F2F backup applications, users exchange some of 
their spare disk resources with their friends in order to 
realize a collaborative data backup service. However, 
different users may be online at different times of the 
day. 

In both cases the insertion of new redundancy by the 
source node is restricted to the periods when the availability 
windows of the source overlap that of the storage nodes. 

Unlike replication, where in-network redundancy gener- 
ation is achieved trivially, traditional erasure codes are not 
amenable. Our network coding inspired solution is based on 
a novel family of erasure codes called Self-Repairing Codes 
(SRC) fTTl, recently proposed in order to improve data 
repair efficiency. A salient property of SRC codes is that the 
encoded data stored at each node can be easily regenerated 
by using information from a few other storage nodes. As 
we will show, this is the key property to achieve in-network 
redundancy generation. However, SRC codes have strict 
constraints on how storage nodes can combine their data 
to generate content for other nodes, which, along with the 
temporal availability constraints of nodes, complicates the 
design of efficient in-network redundancy generation. 

The main contributions of this paper are as follows: (i) we 
introduce the concept of in-network redundancy generation 
for reducing data insertion latency in erasure code based 
storage systems, (ii) we define an analytical framework to 
explore valid transfer schedules, (iii) we show that determin- 
ing optimal schedules is computationally intractable, and 
(iv) we propose a set of heuristic algorithms for efficient 
in-network redundancy generation. 

We experimentally validate these ideas using real avail- 
ability traces from friend-to-friend (F2F) |6| and peer-to- 
peer (P2P) 114| applications, and synthetic traces to explore 
datacenter-like scenarios and show that these algorithms 
substantially increase the throughput of the backup process. 



II. Related Work 

The P2P research community has long studied the appli- 
cability of erasure codes in low availability environments 
with limited storage capacity Q, ifTTI . More recently, there 
has been a growing interest in applying erasure codes in 
datacenters to reduce storage costs fTSll - lfTSll . 

While I/O and bandwidth are well recognized critical 
bottlenecks for the storage of huge amounts of data, ex- 
isting literature does not explore how data insertion can be 
optimized in the context of erasure codes based storage. We 
instead identify and discuss some peripherally related works. 

Decentralized erasure codes |19| explored in the context 
of sensor networks ||20| , ||2T| are arguably the closest related 
works. In such settings, the (disjoint) data generated by k 
sensors is jointly and redundantly stored by n storage sensors 
based on erasure coding principles, where the data is re- 
distributed among the storage (sensor) nodes using network 
coding 1 22 1, a popular mechanism deployed to improve the 
throughput utilization of a given network topology. This line 
of work did not explore the effect of temporal unavailability 
of nodes during the redundancy generation process, and also 
does not map readily to the scenario of one data source 
injecting data to other storage nodes, as studied in this work. 

Benefits of network coding have further been exploited 
in lfT2l to restore encoded fragments (lost due to failures) 
in a distributed manner, while |13| achieves the same by 
instead designing customized Self-Repairing codes. This 
work leverages on Self-Repairing codes in order to carry 
out in-network redundancy generation for opportunistically 
speeding up the data insertion and backup process. 

III. Erasure Codes and Self-Repairing Codes 

In this section, we provide some background on erasure 
codes as classically used for distributed storage, as well as on 
the newly introduced class of Self-Repairing Codes (SRC). 

A. Erasure codes for distributed storage. 

A classical {n, k) erasure code allows to redundantly 
encode an object of size M into n redundant fragments of 
size M/k, each to be stored in a different storage node. 
The data storage overhead (or redundancy factor) is then 
given by n/k, and the stored object can be reconstructed by 
downloading an amount of data equal to M, from k or more 
different nodes out of n. 

One of the main drawbacks of using classical erasure 
codes for storage is that redundant fragments can only be 
generated by applying coding operations on the original 
data. The generation of new redundancy is then restricted to 
nodes that possess the original object (or a copy), namely: 
the source node, storage nodes that previously reconstructed 
the original object, or possibly were storing a copy (as 
is the case in a hybrid model where a full copy of the 
object is kept, together with encoded fragments). When 
the original raw object is not available, repairing a single 



node failure consequently entails downloading an amount 
of information equivalent to the size of the original object, 
causing a significant communication overhead. 

In order to mitigate this communication overhead, a new 
family of erasure codes called Regenerating Codes |12| was 
recently designed by adopting ideas from network coding. 
The main advantage of Regenerating Codes is that new 
redundant fragments can be generated by downloading an 
amount of data /3 from d other redundant fragments, where 
d > k, and /3 < M/k. Unlike in classical erasure codes. 
Regenerating Codes can thus repair missing fragments by 
downloading only an amount of data equals to d(3, where 
usually d/S ^ M. However, the biggest communication 
savings occur when d > fc is a large value, in which case 
however, the probability to find d nodes available might be 
very scarce, limiting the practicality of such codes. 

B. Homomorphic Self-Repairing Codes (HSRC). 

Self-Repairing Codes (SRC) lfT3l are new erasure codes 
designed to minimize the maintenance overhead by reducing 
the number of nodes d required to be contacted to recreate 
lost fragments. A specific family of SRC codes is Homo- 
morphic Self-Repairing codes (HSRC), where two encoded 
fragments can be xored for such regeneration, i.e. d = 2, as 
long as not more than half of the nodes have failed. This 
property makes HSRC codes very suitable for the in-network 
redundancy generation since partial redundant data stored in 
two different nodes can be used to generate data for a third 
node, without requiring the intervention of the source node. 
However, as we will show below, the pairs of nodes used 
for that purpose cannot be arbitrary chosen. 

Let us recall briefly the construction of HSRC. We denote 
finite fields by F. The cardinality of F is given by its index, 
that is, F2 is the binary field with two elements (the two bits 
and 1), and ¥q is the finite field with q elements. If q ~ 2"\ 
for some positive integer m, we can fix a F2-basis of F^ 
and represent an element x G F2™ using an m-dimensional 
vector X = (xi, . . . , Xm) where Xi G F2, i — 1, . . . ,m. 

Let o be the object to be stored over a set of n nodes, 
which is represented as a data vector of size kxm bits, i.e.: 

O = (Oi, . . .,0k), Oi e F2"^. 

Given these k original elements, the n redundant fragments 
are obtained by evaluating the polynomial 

k 

p(X)-^0,X2""' eF2™[X] (1) 
i=l 

in n non-zero values Q!i,...,a„ of F2m, yielding the 
redundant vector r of size n x m bits, i.e.: 

r = (ri,...,r„), = G F2^^ . 

In particular we need the code parameters {n, k) to satisfy 

1< fc < n < 2™ - 1. 



IV. HSRC Redundancy Generation 

The main important property of HSRC codes is its homo- 
morphic property. From |.13i we have that: 

Lemma 1: Let a, 6 G F2m and let p{X) be the polynomial 
defined in ([T]i, then p{a + b) = p{a) + p{b). 
This implies that we can generate a redundant element = 
p{ak) from n piai) and = piaj) iff ak = ai + aj. 

Example 1: Consider a (71 = 7, A; = 3) HSRC code and 
an object o — (01,02,03) of size 3x4 bits, where 

01 G F24, i = 1,2,3. We write Oi = (011,012,013,014), 

02 = (021,022,023,024), 03 = (031,032^033,034), from 
which we compute p{X) — X]i=i (^i^^ ■ We evaluate 
p{X) in n — 7 values of F24, represented in vector form 
as ai = (1,0,0,0), 0^2 = (0,1,0,0), 03 = (1,1,0,0), 
UA ^ (0,0,1,0), as = (1,0,1,0), as = (0,1,1,0), 
ay — (1, 1, 1, 0), yielding: 

n = p(ai) = (oil + 021 + 031,012 + 022 + 032,013 + 023 + 

033, 014 -I- 024 + 034), 

r2 = p{a2) = (014 + 023 + 031 + 034, Oil + 014 + 023 + 024 + 
031 + 032 -I- 034 , 012 + 02 1 + 024 + 0232 + 033 , 013 + 022 + 033 -I- 034 ) 
rs = p(o3) = (oil + 021 + 031 + 014 -I- 023 +0 31 + 034, 012 -I- 
022 +O11 +O14 +023 +024 +031 +034, O13 + 023 +012 +021 + 
024 + 032, Oi4 -I- 024 + 013 + 022 + 033) 

r4 = p(a4) = (013 + 024 + 021 + 031 + 033 , 013 -I- 014 + 021 + 
022 + 024 + 032 +033 + 034, Oil + Ol4 + 022 + 023 +031 + 033 -I- 

034, 012 + 023 + 024 + 032 + 034) 

rs =p(q5) = (011+013+024 + 033, 012+013+014 + 021+024 + 
033 +034, Oil +013 +014 +022 +031 +034, 012 +014 +023 +032) 

re = p{ae) = (014 + 023+034 + 013+024+021 + 033,011+023 + 
031 + 013 + 021 + 022 + 033 , 012 + 021 + 024 + 032 + oil + 014 + 

022 + 023 + 031 + 034, 013 + 022 + 033 + 012 + 023 + 024 + 032) 
rr = p(a7) = (011 + 014+023 + 034+013+024 + 031+033,011 + 

023 + 031 + 013 + 021 + 012 + 033 + 032 , 012 + 021 + 024 + 032 + 
Oil + Oi4 + 022 + 033 + 031 + 034 + O13 , O13 + 022 + 033 + 012 + 
023 + 034 + 032 + 014) 

We can check that 

p[ar) = p{ai) + p{ae) = p{a2) +^(05) = ^("3) +p(a4), 
p{ae) p{ai) + piar) = p{a2) +^(04) = ^("3) +p(a5)- 

Note that we have not used the vector = (0, 0, 0, 1) here, 
which would have resulted in a longer code n > 7. 
We now discuss how HSRC codes operate in two different 
scenarios: (i) when the source introduces data in the system, 
and (ii) during the in-network redundancy generation. 

A. Source Redundancy Generation 

The homomorphic property described in Lemma [T] has 
been introduced to repair node failures, though it can simi- 
larly serve to generate redundancy from the source. Recall 
from Lemma[T]that p{a + b) = p{a) +p{b), where both a, b 
can be seen as m-dimensional binary vectors, by fixing a 
F2-basis of F2m. Let us denote this basis by {61, . . . , &,„}. 
Thus a can be written as a = J^TLi o^ih, cli G IF2, and by 
virtue of the homomorphic property, we get that 

Cm \ in 

^aib, \ = ^a^p{bi). 
i=l ) 1=1 



This means that the source only needs to compute 
p(6i), . . . ,p{bm) for a given basis {bi, . . . , &,„}, after which 
all the other encoded fragments are obtained by xoring pairs 
of elements in {p{bi) , . . . , p{bm)} ■ Thus, when using an 
{n, k) HSRC code, the source computes k (k < m) encoded 
fragments ri = p{ai), ...,rk = p{ak), where ai, . . . , 
are linearly independent, for example, {ai,...,ak} C 
{bi, . . . , bjn}, and then performs the corresponding xoring. 
The source then injects the n encoded fragments in the 
network. 

Example 2: In Example [T] we have fc = 3 < to — 4, 
and a natural F2-basis for F24 is 61 = (1,0,0,0), 62 — 
(0,1,0,0), 63 = (0,0,1,0), 64 = (0,0,0,1). The source 
can generate redundancy by first computing ri = p{ai) = 
Pibi), r2 = p{a2) = p{b2), = ^(0:4) pibs), then 
rs = Pias) = p{ai) +^("2), 7-5 = pia^) = p{ai) +p{ai), 
re = piae) = p{a2) +^(0:4) and rr = piar) = p(q!i) + 
p{ae). The n = 7 encoded fragments are then ready to 
be sent over the network. Further notice that the set B = 
{p{ai) , p{a2) , p{a4)} can be seen as a basis for the set of 
redundant fragments, since they are linearly independent, 
and can be combined to generate every redundant fragment. 

B. In-Network Redundancy Generation 

Let us now consider the case where the source might not 
inject the whole set of n encoded fragments, but only a 
subset {ri, i G / C {1, . . . , n}} of the encoded fragments. 
We use the triplet notation (i,j) h k to represent the 
possibility to generate the element by xoring and rj, 
fk ~ fi + rj. Note that due to the commutative property 
of the additive operator, triplets {i,j)hk and {j,i)\-k can 
be indistinguishably used to denote the same redundancy 
generation process. We denote by C the set with all the 
feasible repair triplets from a set of n redundant elements. 
Finally, let us define the following two sets: 

Definition 1 (out-creation set): Let 0{i) be the set of all 
the possible {i,j)\-k triplets where fragment is used to 
generate some other fragment: 

0{i) = {{ijyk \ j ^l,...,n,j ^i,k s.t. = + r^} . 

Definition 2 (in-creation set): Let I{k) be the set of all 
the possible {i,j)\-k triplets that can be used to create r^: 

I{k) = {{ijyk I rfe = + r^; i,j^l,...,n}. 

Finally, given a number of redundant elements 71 = 2* — 1, 
for any positive integer t < m, we have from ifTSll that: 

|0(z)|=n-l (2) 
|/(A:)| = (n-l)/2. (3) 



Example 3: In Example [H we have that 

0(1) = {(1, 3)1-2, (1, 2)1-3, (1, 5)1-4, (1, 4)1-5, (1, 7)1-6, (1, 6)1-7} , 
I{7) = {(1, 6)1-7, (2, 5)1-7, (3, 4)1-7} . 

C. HSRC Codes: Practical Implementation 

Previously we detailed how to encode a data vector o of 
size kxm bits into a redundant vector r of size nxm bits. 
We showed that HSRC codes allow this encoding by using 
data from the source node as well as by using data from other 
storage nodes. In this subsection we describe one method to 
practically implement HSRC codes to encode larger data 
objects of size M, where M > k x m. 

The first step to encode an object of size M is to split 
it into u — M/{k X ni) vector^ of size kxm bits. Let 
us represent the object to be encoded as o = (oi, . . . , om)- 
After the splitting process, o = (61, . . . , 6„), where 6; = 
(ofe(j-i)+i, ■ • • , Ofc(j„i)+fe), Ok{i-i)+j e F2.n, j = 1, . . . ,k. 
Each of these vectors 6^ is individually encoded using the 
polynomial ([Hi to obtain an encoded vector f = (fi , . . . , f „), 
\ri\ = n. Finally, the vector r — (ri,...,r„) with the n 
fragments to be stored in the system are obtained by concate- 
nating individual elements of r so that = (f • ■ • , ^u,i)- 

Example 4: Consider the (n = 7, fc = 3) HSRC code and 
the object o = {oi, . . . , og), where Oi G F2. We split the 
object into u = 3 vectors o — (61,62,63), where 61 = 
(01,02,03), 62 = (04,05,03) and 63 = (07,08,09). After 
encoding each of the individual vectors we obtain the set of 
redundant vectors r = (fi, . . . , f3) (6j is encoded to obtain 
Ti), where |ri| = |f2| = |f3| = 7. Finally, we can obtain 
r2 — (ri,2, f2,2, f3,2), and similarly for all the fragments. 

Remark 1: Note that this encoding technique allows 
stream encoding. As soon as the source node receives the 
first kxm bits to stored it can generate the vector 61, encode 
it to fi, and distribute ri i, . . . , fn.i to the n storage nodes. 
Similarly, when a storage node receives f i i it can forward 
it to other nodes for in-network redundancy generation, for 
instance, when the source does not have adequate bandwidth 
to upload all the n redundant fragments. 

Remark 2: To implement computationally efficient codes 
one can set to = 8, or to = 32, for which addition can 
simply be done by xoring system words, and for which 
efficient arithmetic libraries are available ||23l . 

In the rest of this paper we will assume that HSRC 
codes are implemented using the method described here. 
We will use the term redundant fragment to refer to each 
of the redundant elements ri, . . . , r„, i.e., each node stores 
one redundant fragment. And similarly we will use the 

^We assume that k X m\M. Otherwise the object can can be zero-padded 
to guai'antee it. 

'a gateway node in a datacenter receiving data from a web apphcation 
end user would be treated as the source node in our model. In such 
scenarios, the source itself may not be in possession of the whole data 
in advance. 



term redundant chunk to refer to each of the sub-elements 
rii,...,f„i stored in each node i, i.e., each node i can 
store up to u redundant chunks. 

V. Scheduling the In-Network Redundancy 
Generation 

In-network redundancy generation has the potential to 
speedup the insertion of new data in distributed storage 
systems. However, the magnitude of actual benefit depends 
on two factors: (i) the availability pattern of the source and 
storage nodes, which determines the achievable throughput, 
and (ii) the specific schedule of data transfer among nodes 
subject to the constraints of resource availability, which 
determines the actual achieved throughput for data backup. 
In this section, we explore the scheduling problem, demon- 
strating that finding an optimal schedule is computationally 
very expensive even with a few simplifying assumptions, 
and accordingly motivate some heuristics instead. 

Let s be a source node aiming to store a new data object to 
n different storage nodes, and let i, i — 1, . . . ,n, represent 
each of these n storage nodes. We model our system 
using discrete time steps of duration r, where at each time 
step nodes can be available or unavailable to send/receive 
redundant data. The binary variable a{i,t) G {0, 1} denotes 
this availability for each node i for the corresponding time 
step t. Using this binary variable we can define the maximum 
amount of data that node i can upload during time step t by 

u{i,t) = a{i,t) ■ Wit (t) ■ T, 

where w^t (t) is the upload capacity of node i during time 
step t. Similarly, the amount of data each node can download 
during time step t is given by 

d{i, t) — a{i, t) ■ uji-l (t) ■ T, 

where Wi J, (t) represents the download capacity of node i 
during time step t. 

Then, we define the in-network redundancy generation 
network as a weighted temporal directed graph G = 
{E{t),V{t)), t > 0, with the set of nodes V{t) C 
{s, 1, 2, . . . , n}, and the set of edges E{t) — G 
V{t)}. The amount of data that nodes send among them- 
selves is a mapping / : E{t) — s> M+, denoted by f{i,j,t), 

\/ii,j)eE{t),t>0. 

HSRC code characteristics constrain the mapping / since 
nodes can only send or receive data trough valid redundancy 
creation triplets: 

f{i,k,t) > 4=> 3 c(=C s.t. c= h k. (4) 

Furthermore, we assume (for algorithmic simplicity) that 
nodes send data through each of the redundancy generation 
triplets symmetrically: 

i?(c, t) = fii, k, t) = /(j, k,t),yceC;c=ii, jyk. (5) 



For ease of notation we will refer to the data sent through 
each of the redundancy generation triplets simply by R{c, t). 

Similarly, because of the upload/download bandwidth 
constraints, the mapping / must also satisfy the following 
constraints: 

< The amount of data the source uploads is constrained 
by its upload capacity: 

n 

^/(s,^,^)<u(s,^); V^en^)• (6) 

1=1 

« The amount of data storage nodes upload is also 
constrained by their upload capacity: 

R{c, t) < u{i, t); Vie V(t). (!) 

ceo(i) 

< The amount of data storage nodes download is re- 
stricted by their download capacity: 

/(s, i, t) + 2^i?(c, t) < d{i, t); Vi e V{t). (8) 

ce/(i) 

A Bandwidth-Valid In-Network Redundancy Generation 
Scheduling is any mapping f on G that satisfies the con- 
straints defined in equations (|4]i, (|5]l, ©, ^ and (O. 

A. Optimal Schedule 

Let d{i, t) be the amount of data that node i had received 
at the end of time step t. For sufficiently large enough 
files and small values of m (e.g. m — 8), we can assume 
without loss of generality that d{i,t)/m corresponds to the 
index of the last redundant chunk received by node i. Let 
M{t) denote the size of the largest possible file that a 
schedule / can store in t time steps. Then, by definition 
of erasure codes, to consider that a file of size M{t) has 
been successfully stored after t time steps, each node must 
receive an amount of data equal to M{t)/k. Using this fact 
we can define M{i) as: 

M{t) ^ min {9{1, t), . . . , 0{n, t)) x mk. 

For a given network G and a duration t, an in-network 
redundancy generation scheduling / is then optimal if it 
maximizes M{i). 

Note that after t time steps, the overall network traffic 
required by any schedule /, namely T{f,i), is equal to: 

Tif, = E ( 2 E ^(^' + E /(^' ^) ) ■ 

t=a \ cec 1=1 / 

Accordingly, we define an in-network redundancy gener- 
ation schedule / to be an optimal minimum-traffic schedule 
if besides maximizing M{t), it also minimizes T{f,t). 

Remark 3: Note that to create the same amount of new 
redundancy the in-network redundancy generation requires 
twice the traffic required by the source redundancy genera- 
tion. 



Scheduling policy 1 (invalid): 




Figure 1. Example of 3 different in-network redundancy scheduling 
policies for a system where the source node can only upload data con- 
currently to 3 different nodes. Although the 3 different schedules satisfy 
the constraints, only scheduling policy 3 is valid. 



B. Additional Scheduling Constraints 

In this section we elaborate that while being a bandwidth- 
valid schedule is a necessary condition, it is not a sufficient 
condition for the schedule to be actually valid. For that, we 
will use Example |4] an in-network redundancy generation 
network using HSRC code with parameters (n = 7, fc = 3), 
where the redundant fragments ri, . . . , ry have to be stored 
in nodes 1, . . . , 7 respectively. Recall also that each redun- 
dant fragment is composed of 3 redundant chunks, hence 
|ri| = 3. For ease of notation we will assume that each 
redundancy generation triplet c = {i,j)\-k, c £ C, satisfies 
the property k — i ® j where denotes the bitwise xor 
operation. Based on Example HI we consider three different 
scheduling policies, all depicted in Figure [T] We assume that 
due to the limited upload capacity of the source node it can 
only upload three redundant fragments simultaneously. 

In the first scenario, at time t = the source node sends 
to nodes 1, 2 and 3 their first redundant chunk; at time 
time t ~ 1 \\. does the same for nodes 5, 6 and 7. Note 
that if at time step t — 1 the mapping / tries to make use 



of the in-network redundancy generation triplets (l,6)h7, 
(2,7)1-5 and (3,5)h6; nodes 5, 6 and 7 end up receiving 
the same redundant fragment twice. In this case the in- 
network redundancy traffic does not contribute in speeding 
up the backup process and only consumes communication 
resources. Although avoiding this problem is implicit in the 
definition of a minimum-traffic scheduling, it needs to be 
explicitly considered during the scheduling. 

Consider a second scheduling policy trying to solve the 
previous problem by sending to nodes 5, 6 and 7 the second 
chunk instead of the first. It allows these nodes to receive 
two different fragments by time t — \. However, it appears a 
circular dependency problem with triplets (1, 6)h7, (2, 7)h5 
and (3, 5)h6. To show this dependency, imagine that we want 
to generate fragment rg^i using non-source data. Note that 
rg,! requires r5,i, rs i requires ry i, which at the same time 
requires the fragment we aim to generate, rg.i. Although 
it is a bandwidth-valid schedule, the circular dependency 
problem makes it an unfeasible schedule. 

Finally, in the third case we see how the circular de- 
pendency problems can be avoided if the source sends 
uncorrelated fragments at each time step. It is easy to see 
from this example that a valid schedule needs to be not 
only bandwidth-valid, but also ensure that: (i) nodes do not 
receive duplicated data, and (ii) circular triplet dependencies 
are prevented. 

C. Complexity Analysis 

We show that finding an optimal schedule satisfying all 
the previous requirements is computationally very expensive, 
even under further simplifying assumptions: 

Assumption 1: The amount of data that the source node s 
sends during each time step t to any storage node i, /(s, i, t), 
is a constant value and is not part of the optimization 
problem. 

Assumption 2: Storage nodes can only receive redundant 
chunks sequentially. It means that node i will never receive 
chunk Tj+i i before previously receiving chunk i. 

It is easy to see that the simplified problem subject to 
these two assumptions corresponds to a specific instance of 
the generic case described above. The interesting property 
about this simplified version of the problem is that we can 
reduce the decision of choosing the optimal schedule / to 
an algorithm "SortedVector" which sorts C, as it is shown in 
Algorithm [T] It is also easy to see, how due to the iterative 
use of redundancy generation triplets. Algorithm [T| avoids 
both the "duplicate data" and the "circular dependencies" 
problems. However, since \C\ — n{n — 1), it means that 
there are {n{n — 1))! possible ways of sorting C, and thus, 
t X {n{n — 1))! different scheduling possibilities. Thus, a 
brute force algorithm to determine the best schedule would 
have a 0{n\) cost. 

If we focus on a single time step t, then the schedul- 
ing problem can be restated as how to choose the best 



Algorithm 1 Creating a valid optimal schedule under As- 
sumption [T]& Assumption 121 

for i in 1,2, ... ,n do 
6'(node,0) ^ 

end for 

for < in 0, 1, 2, . . . , t do 
for i in 1, 2, . . . , n do 

9{i,t)^e{i,t) + fis,t,t) 

end for 

triplets ^ SortedVector(C) 
for z in 1, . . . , |C| do 

c ^ triplets [z] 

/* c= (i,j)hfc */ 

availDataByBw min(7i(z, i), u(j,t), d{k,t)) 
availDataBylndex ^ mm(9{i,t), 9{j,t)) - 9{k,t) 
availDataBylndex ^ max(availDataIx, 0) 
availData = min(availDataByBw, availDataBylndex) 

R{c, t) <— availData 
9{k,t) ^ 9{k,t) + availData 
end for 
end for 




Figure 2. Example of a permutation tree to implement "SortedVector" 
in Algorithm [T] We assume in this example a hypothetical system where 
|C| = 4. 

permutation of C. We can represent this decision problem 
using a permutation tree as is depicted in Figure |2] The 
weight of the edges in this permutation tree correspond to 
the negative amount that choosing each edge contributes to 
M{t). Choosing the best scheduling algorithm tis the same 
than finding the shortest path between vertices v and d in 
the permutation tree. The Bellman-Ford algorithm can find 
the shortest past with cost 0{\E\ ■ \V\) where \E\ and \V\ 
respectively represent the number of edges and vertices in 
the permutation tree. However, in our permutation tree the 
number of edges and vertices are both {n{n — 1))!, which 
makes finding the optimal schedule for even the simplified 
problem computationally exorbitantly expensive, even for 
small number of nodes n. Hence, we consider the general 



problem described in IV-AI to be also intractable. 

VI. Heuristic Scheduling Algorithms 

In this section we investigate several heuristics for 
scheduling the in-network redundancy generation. We split 
the scheduling problem into two parts, following the strategy 
presented in Algorithm [T] 

The heuristics do not require Assumption [T] thus allowing 
the source node to send different amounts of data to each 
storage node. We however still rely on Assumption|2] which 
allows us to model the decision problem with a sorting 
algorithm, as previously outlined in Algorithm [T] Thus, the 
overall scheduling problem is decomposed into the following 
two decisions: (i) How does the source node schedule its 
uploads? (ii) How are redundancy generation triplets sorted? 

A. Scheduling traffic from the source 

Recall that generating redundancy directly from the source 
node involves less bandwidth than doing it with in-network 
techniques (Remark |3]l. Thus, a good source traffic schedul- 
ing should aim at maximizing the source's upload capacity 
utilization. Furthermore, the schedule must also try to ensure 
that the source injected data can be further used for the in- 
network redundancy generation. 

Given a {n, k) HSRC code, where n = 2*^ — 1, any 
subset of k linearly independent encoded fragments forms 
a basis, denoted by B (see Example |2] for an illustration). 
Let B be the set of all the possible bases B. Since each 
storage node stores one redundant fragment, we use 13{t) 
to represent all the basis of B whose corresponding storage 
nodes are available at a time step t (and likewise, refer to 
each combination of such nodes as an available basis): 

B{t) = {Bc,B\ a{i, t) = 1, Vi e 5} . 

From the set of available basis, B{t), the source node 
selects one basis B and uploads some data to each node 
i £ B. The amount of data the source uploads to each node 
i G i? is set to guarantee that at the end of time step t, all 
these nodes have received the same amount of data, 9{i, t) = 
9{j,t), yi,j G B. To satisfy this while maximizing the 
upload capacit}0 utilization of the source, the source needs 
to send to each node i E B an amount of data equal to 

^ ^u{s,t) + Y,0ij>t-l)^ ^9{i,t). 

We considered the following policies for the source node 
to select a specific B among all the available basis, B{t): 

• Random: B is randomly selected from B{t). Repeating 
this procedure for several time steps is expected to 
ensure that all nodes receive approximately the same 
amount of data from the source. 

■*We assume that the upload capacity of the source is less than the 
download capacity of the basis nodes. 



• Minimum Data: The source selects the basis B that on 
an average has received less redundant data. It means 
that B is the basis that minimizes j X^ies ^(*' This 
poUcy tries to homogenize the amount of data all nodes 
receive. 

> Maximum Data: The source selects the basis B that on 
an average has received more redundant data. It means 
that B is the basis that maximizes ^ '^i^B ^(*' This 
policy tries to have a basis of nodes with enough data 
to allow the in-network redundancy generation for the 
entire data object even when the source may not be 
available. 

• No Basis: The source does not considers any basis and 
instead uploads data to all the online nodes. The upload 
bandwidth of the source is also distributed to guarantee 
that, after time step t, all online nodes have received 
the same amount of data. 

B. Sorting the redundancy generation triplets 

We explore the following sorting heuristics to answer the 
second question: 

• Random: Repair triplets are randomly sorted. This 
policy tries to uniformly distribute the utilization of 
network resources to maximize the amount of in- 
network generated data. 

> Minimum Data: The list of available triplets are sorted 
in ascending order according to the amount of data 
0{k, t) the destinatiorll node k has received. This policy 
tries to prioritize the redundancy generation in those 
nodes that have received less redundant data. 

« Maximum Data: Similarly to the Minimum Data pol- 
icy, however, triplets are sorted in descending order 
This policy tries to maximize the amount of data some 
specific subset of nodes receive, to allow them to 
sustain the redundancy generation process even when 
the source is not available. 

« Maximum Flow: The triplets are sorted in descending 
order according to the amount of redundant data these 
nodes can help generate. Note that the amount of data 
a triplet c can generate at each time step t, where c = 
{i,j)\-k, is given by: 

min{u{i,t), u{j,t), d{k,t), 

e{i,t) -e{k,t), 
9{j,t)-e{k,t)) 

This policy tries to maximize the amount of new 
redundancy generated per time step. 

VII. Experimental results 

He have proposed four different policies for the source 
traffic scheduling problem and four policies for the triplets 

'Node k is the destination of a triplet c, c = (i, j)hfc. 



sorting problem. However, due to space limitations we report 
for each case the two best policies (in terms of achieved 
throughput). At the source, the random and minimum data 
policies consistently outperform the others, and at the stor- 
age nodes, the maximum flow and minimum data sorting 
policies for the triplets likewise outperform the others. We 
will refer to each of the combinations as follows: 



Policy Name 


Source Policy 


In -Network Policy 


RndFlw 


random 


maximum flow 


RndDta 


random 


minimum data 


MinFlw 


minimum data 


maximum flow 


MinDta 


minimum data 


minimum data 



It is interesting to note that the minimum data policy ob- 
tains good storage throughput in both cases, which leads us 
to infer that in general, prioritizing redundancy generation 
in those nodes that have received less data is a good strategy 
to maximize the throughput of the backup process. 

A. Setting 

We considered a (n = 7, fc = 3)-HSRC code, which is a 
code that can achieve a static data resiliency similar to a 
3 -way replication, but requiring only a redundancy factor 
of 7/3 ~ 2.33. 1 13 1 Using this erasure code we simulated 
various backup processes with different node (un)availability 
patterns for a fixed number of time steps i. In all the 
simulated cases we consider three different metrics: 

(i) The maximum amount of data that can be stored in t 
time steps, M{t). 

(ii) The amount of data the source node uploads per unit of 
useful data backed up. 



1 



(iii) The total traffic generated per unit of useful data 
stored, T{f,t)/M(i). 

We evaluate the three metrics for a system using an in- 
network redundancy generation algorithm and we compare 
our results with a system using the naive erasure coding 
backup process, where the source uploads all the data 
directly to each storage node. Our results depict the savings 
and gains, in percentage, of using an in-network redundancy 
algorithm with respect to the naive approach. 

Regarding the (un)availability patterns of nodes and their 
bandwidth constraints we consider two possible cases: 
• A P2P-like environment where nodes have an upload 
bandwidth uniformly distributed between 20Kbps and 
200Kbps, and an asymmetric download bandwidth 
equal to four times their upload bandwidth. Nodes in 
this category follow two different availability traces 
from real decentralized application: (i) traces from users 
of an instant messaging (IM) service (ii) traces 
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Figure 3. Results obtained by comparing the performance of a naive erasure code storage process and the in-network redundancy generation process 
using the IM traces (a,b,c), KAD traces (d,e,f) and datacenter-like traces (g,h,i). For IM and KAD traces we consider 3 different cases where we filter 
nodes whose average availability is less than 4h, 6h and I2h per day. For the datacenter case we consider 4 different node average availability values, 
namely 10%, 30%, 60% and 100%. 



from P2P nodes in the aMule KAD DHT overlay lfT4l . 
In both cases we fiher the nodes that on average stay 
onHne more than 4, 6 and 12 daily hours, obtaining 
different mean availability scenarios. Finally the time 
step duration is set to r = Ihour and we obtain the 
results by averaging the results of 500 backup processes 
of t = 120 time steps each (5 days). 
• A datacenter-like scenario where nodes have a sym- 
metric upload/download bandwidth equal to IGbps. 
These nodes have availability sessions that follow an 
exponential distribution with rate Aon = (2/i x a)~^ 
and unavailability sessions that follow an exponential 
distribution with rate Ao// — {2h x (1 — a))^^, where 



a is the average online availability. We simulate 4 
average online availabilities, namely, 10%, 30%, 60% 
and 100%. In this case the time step duration is set to 
T = 5minutes and we obtain the results by averaging 
the results of 500 backup processes of t = 144 time 
steps each (12 hours). 

B. Results 

In figures (l3]a,[3]b,[3]c) and ([3]d,[3]e,[3]f) we show results 
based on F2F and P2P scenarios (using IM and KAD traces 
respectively). Figures [3]a and [3]d show how the storage 
throughput increases with nodes being more available on 
an average. This is due to the constraint in eq. (|5]l requiring 



redundancy generation triplets to be symmetric, which re- 
quires the three involved nodes in each triplet to be available 
simultaneously. The higher the online availability, the higher 
the chances to find online three nodes from a triplet. Further, 
we observe that the RndFlw policy achieves significantly 
better results in comparison to other policies. 

As noted previously (in Remark O, the total traffic re- 
quired for in-network redundancy generation is twice that 
needed by the traditional process. Figures [3]b and |3]e con- 
firm this observation. We additionally note that the increase 
in traffic is approximately the same or even less than the in- 
crease in storage throughput even for low availability (> 4h) 
scenarios. Thus the in-network redundancy generation scales 
well by achieving a better utilization of the available network 
resources than the classical storage process. 

In the traditional approach, the source needs to upload 
7/3 ~ 2.33 times the size of the actual data to be stored; 
4/7 ~ 57% of this data is redundant. Figures [3] c and [3]f 
show the reduction of data upload at the source. In the best 
case (> 12/i traces and RndFlw policy) our approach reduces 
the source's load by 40% (out of a possible 57%), yielding 
40-60% increase in storage throughput (figures [3]a and[3]d). 

Figures (|3]g, |3]h, [3]i) show results for the datacenter-like 
scenario. When node availabilities are high, we note that 
significant throughput gains can be achieved (upto 140%). It 
is interesting to see how in the case of low node availability 
(10%) the total amount of data that can be stored with the in- 
network redundancy generation technique is less than using 
the traditional storage processes. This is an artefact of two 
shortcomings - one with our scheduling algorithm, and one 
with the synthetic trace we generated. 

Finding three available nodes simultaneously is unlikely 
when overall availability is low. To solve this problem, we 
would need to look at more sophisticated in-network redun- 
dancy generation strategies not subjected to the symmetric 
constraint (defined in eq. (|5]l), so that nodes can forward 
and store partially-generated data. However, the scheduling 
problem will be much more complicated, and is beyond the 
reach of this first work. Furthermore, in real traces, nodes 
will have correlation (e.g., based on batch jobs), which are 
missing in the synthetic traces, and such correlations can be 
leveraged in practice. Exploring both these aspects will be 
part of our future work. 

VIII. Conclusions 

In this work we propose and explore how storage nodes 
can collaborate among themselves to generate erasure en- 
coded redundancy by leveraging novel erasure codes' self- 
repairing property thus reducing a source node's load, and 
improving overall throughput for data backup. We demon- 
strate that finding an optimal schedule is computationally 
prohibitive (even under simplifying assumptions), but exper- 
iments based on heuristics yield significant gain in storage 
throughput under diverse settings, proving their practicality. 
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